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We consider the structure learning problem for graphical mod- 
els that we call loosely connected Markov random fields, in which 
the number of short paths between any pair of nodes is small. We 
point out that many previously studied models are examples of this 
family. However, due to the existence of short cycles, some previ- 
ous methods fail to detect all the edges in some of these graphical 
models. We present a new algorithm for learning the structure of 
loosely connected Markov random fields from i.i.d. samples. The key 
step for the algorithm is a max-min conditional independence test, 
in which the maximization step is to detect the edges while the min- 
imization step is to detect non-edges. The minimization step is used 
in several previous works. The maximization step has been added 
to explicitly break the short cycles that can cause problems in edge 
detection. We show that, under certain non-degeneracy conditions, 
our algorithm learns the graph correctly with high probability using 
n = O(logp) samples, where p is the size of the graph. For models 
with at most Di short paths between non-neighbor nodes and D2 
non-direct paths between neighboring nodes, the running time of our 
algorithm is 0(np^i"'"^^+^). If in addition the Markov random field 
has correlation decay and satisfies a pairwise non-degeneracy condi- 
tion, an extended algorithm can be applied and the running time is 
further reduced to 0{np^) with a preprocessing step. If we know that 
the MRF is a ferromagnetic Ising model, we can remove the maxi- 
mization step in the algorithm, which gives running time 0(np^^^^), 
and the extended algorithm can be applied. In several special cases 
of loosely connected Markov random fields, our algorithm achieves 
the same or lower computational complexity than the previously de- 
signed algorithms for individual cases. We also get new results for 
more general graphical models, in particular, our algorithm learns 
general Ising models on the Erdos-Renyi random graph Q{p, ^) cor- 
rectly with running time O(np^). 
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1. Introduction. In many models of networks, such as social networks 
and gene regulatory networks, each node in the network represents a ran- 
dom variable and the graph encodes the conditional independence relations 
among the random variables. A Markov random field is a particular such 
representation which has applications in a variety of areas (see [3] and the 
references therein). In a Markov random field, the lack of an edge between 
two nodes implies that the two random variables are independent, condi- 
tioned on all the other random variables in the network. 

Structure learning, i.e, learning the underlying graph structure of a Markov 
random field, refers to the problem of determining if there is an edge be- 
tween each pair of nodes, given i.i.d. samples from the joint distribution of 
the random vector. As a concrete example of structure learning, consider 
a social network in which only the participants' actions are observed. In 
particular, we do not observe or are unable to observe, interactions between 
the participants. Our goal is to infer relationships among the nodes (par- 
ticipants) in such a network by understanding the correlations among the 
nodes. The canonical example used to illustrate such inference problems is 
the US Senate [4]. Suppose one has access to the voting patterns of the 
senators over a number of bills (and not their party affiliations or any other 
information), the question we would like to answer is the following: can we 
say that a particular senator's vote is independent of everyone else's when 
conditioned on a few other senators' votes? In other words, if we view the 
senators' actions as forming a Markov Random Field (MRF), we want to 
infer the topology of the underlying graph. 

In general, learning high dimensional densely connected graphical models 
requires large number of samples, and is usually computationally intractable. 
In this paper, we focus on a more tractable family which we call loosely con- 
nected MRFs. Roughly speaking, a Markov random field is loosely connected 
if the number of short paths between any pair of nodes is small. We show 
that many previously studied models are examples of this family. In fact, 
as densely connected graphical models are difficult to learn, some sparse as- 
sumptions are necessary to make the learning problem tractable. Common 
assumptions include an upper bound on the node degree of the underlying 
graph [6, 14], restrictions on the class of parameters of the joint probability 
distribution of the random variables to ensure correlation decay [6, 14, 2], 
lower bounds on the girth of the underlying graph [14], and a sparse, proba- 
bilistic structure on the underlying random graph [2] . In all these cases, the 
resulted MRFs turn out to be loosely connected. In this sense, our definition 
here provides a unified view of the assumptions in previous works. 

However, loosely connected MRFs are not always easy to learn. Due to the 
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existence of short cycles, the dependence over an edge connecting a pair of 
neighboring nodes can be approximately cancelled by some short non-direct 
paths between them, in which case correctly detecting this edge is difficult, 
as shown in the following example. This example is perhaps well-known, but 
we present it here to motivate our algorithm presented later. 

Example 1.1. Consider three binary random variables Xi G {0, l},i = 
1,2,3. Assume Xi,X2 are independent Bernoulli{^) random variables and 
X3 = Xi ® X2 with probability 0.9, where © means exclusive or. We note 
that this joint distribution is symmetric, i.e., we get the same distribution if 
we assume that X2, X3 are independent Bernoulli{^) and Xi = X2©X3 with 
probability 0.9. Therefore, the underlying graph is a triangle. However, it is 
not hard to see that the three random variables are marginally independent. 
For this simple example, previous methods in [I4, 3] fail to learn the true 
graph. □ 

We propose a new algorithm that correctly learns the graphs for loosely 
connected MRFs. For each node, the algorithm loops over all the other nodes 
to determine if they are neighbors of this node. The key step in the algorithm 
is a max-min conditional independence test, in which the maximization step 
is designed to detect the edges while the minimization step is designed to 
detect non-edges. The minimization step is used in several previous works 
such as [2, 3]. The maximization step has been added to explicitly break the 
short cycles that can cause problems in edge detection. If the direct edge 
is the only edge between a pair of neighboring nodes, the dependence over 
the edge can be detected by a simple independence test. When there are 
other short paths between a pair of neighboring nodes, we first find a set 
of nodes that separates all the non-direct paths between them, i.e., after 
removing this set of nodes from the graph, the direct edge is the only short 
path connecting to two nodes. Then the dependence over the edge can again 
be detected by a conditional independence test where the conditioned set is 
the set above. In Example 1.1, Xi and X^, are unconditionally independent 
as the dependence over edge (1,3) is canceled by the other path (1, 2, 3). If 
we break the cycle by conditioning on X2, Xi and X^ become dependent, 
so our algorithm is able to detect the edges correctly. As the size of the 
conditioned sets is small for loosely connected MRFs, our algorithm has low 
complexity. In particular, for models with at most Di short paths between 
non-neighbor nodes and D2 non-direct paths between neighboring nodes. 



If the MRF satisfies a pairwise non-degeneracy condition, i.e., the cor- 
relation between any pair of neighboring nodes is lower bounded by some 
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constant, then we can extend the basic algorithm to incorporate a correla- 
tion test as a preprocessing step. For each node, the correlation test adds 
those nodes whose correlation with the current node is above a threshold to a 
candidate neighbor set, which is then used as the search space for the more 
computationally expensive max-min conditional independence test. If the 
MRF has fast correlation decay, the size of the candidate neighbor set can 
be greatly reduced, so we can achieve much lower computational complexity 
with this extended algorithm. 

When applying our algorithm to Ising models, we get lower computa- 
tional complexity for a ferromagnetic Ising model than a general one on the 
same graph. Intuitively, the edge coefficient Jij > means that i and j are 
positively dependent. For any path between as all the edge coefficients 
are positive, the dependence over the path is also positive. Therefore, the 
non-direct paths between a pair of neighboring nodes i,j make Xi and Xj, 
which are positively dependent over the edge even more positively 

dependent. Therefore, we do not need the maximization step which breaks 
the short cycles and the resulting algorithm has running time 0{np^^^'^). 
In addition, the pairwise non- degeneracy condition is automatically satisfied 
and the extended algorithm can be applied. 

1.1. Relation to Prior Work. We focus on computational complexity 
rather than sample complexity in comparing our algorithm with previous 
algorithms. In fact, it has been shown that il(logp) samples are required to 
learn the graph correctly with high probability, where p is the size of the 
graph [18] . For all the previously known algorithms for which analytical com- 
plexity bounds are available, the number of samples required to recover the 
graph correctly with high probability, i.e, the sample complexity, is O(logp). 
Not surprisingly, the sample complexity for our algorithm is also O(logp) 
under reasonable assumptions. 

Our algorithm with the probability test reproduces the algorithm in [6, 
Theorem 3] for MRFs on bounded degree graphs. Our algorithm is more flex- 
ible and achieves lower computational complexity for MRFs that are loosely 
connected but have a large maximum degree. In particular, [14] proposed a 
low complexity greedy algorithm that is correct when the MRF has corre- 
lation decay and the graph has large girth. We show that under the same 
assumptions, we can first perform a simple correlation test and reduce the 
search space for neighbors from all the nodes to a constant size candidate 
neighbor set. With this preprocessing step, our algorithm and the algorithms 
in [6, 14, 17] have computational complexity O(np^), which is lower than 
what we would get by only applying the greedy algorithm [14]. The more 
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recent work [17] improves over [14] by proposing two new greedy algorithms 
that are correct for learning small girth graphs. However, [17] assumes a 
constant size candidate neighbor set as input, which might not be easy to 
get in general. In fact, for MRFs with bad short cycles as in Example 1.1, 
learning a candidate neighbor set can be as difficult as directly learning the 
neighbor set. 

Our analysis of the class of Ising models on sparse Erdos-Renyi random 
graphs G{p, ^) was motivated by the results in [2] which studies the spe- 
cial case of the so-called ferromagnetic Ising models defined over an Erdos- 
Renyi random graph. The computational complexity of the algorithm in [2] 
is O(np^). In this case, the key step of our algorithm reduces to the algo- 
rithm in [2]. But we show that, under the ferromagnetic assumption, we can 
again perform a correlation test to reduce the search space for neighbors, 
and the total computational complexity for our algorithm is O(np^). 

The work [3] extends the results in [2] to general Ising models and more 
general sparse graphs (beyond the Erdos-Renyi model). We note that the 
tractable graph families in [3] is similar to our notion of loosely-connected 
MRFs. For general Ising models over sparse Erdos-Renyi random graphs, 
our algorithm has computational complexity 0{np^) while the algorithm in 
[3] has computational complexity 0{np^). The difference comes from the fact 
that our algorithm has an additional maximization step to break bad short 
cycles as in Example 1.1. Without this maximization step, the algorithm in 
[3] fails for this example. The performance analysis in [3] explicitly excludes 
such difficult cases by noting that these "unfaithful" parameter values have 
Lebesgue measure zero [3, Section B.3.2]. However, when the Ising model 
parameters lie close to this Lebesgue measure zero set, the learning problem 
is still ill posed for the algorithm in [3], i.e., the sample complexity required 
to recover the graph correctly with high probability depends on how close the 
parameters are to this set, which is not the case for our algorithm. In fact, 
the same problem with the argument that the unfaithful set is of Lebesgue 
measure zero has been observed for causal inference in the Gaussian case [19]. 
It has been shown in [19] that a stronger notion of faithfulness is required 
to get uniform sample complexity results, and the set that is not strongly 
faithful has non-zero Lebesgue measure and can be be surprisingly large. 

Another way to learn the structures of MRFs is by solving Zi-regularized 
convex optimizations under a set of incoherent conditions [16]. It is shown in 
[12] that, for some Ising models on a bounded degree graph, the incoherent 
conditions hold when the Ising model is in the correlation decay regime. But 
the incoherent conditions do not have a clear interpretation as conditions 
for the graph parameters in general and are NP-hard to verify for a given 
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Ising model [12]. As there is no analytical result about the computational 
complexity of the solver that is used to solve the convex optimization, it is 
not clear how to compare the computational complexity of our algorithm 
with the one in [16]. 

We note that the recent development of directed information graphs [15] 
is closely related to the theory of MRFs. Learning a directed information 
graph, i.e., finding the causal parents of each random process, is essentially 
the same as finding the neighbors of each random variable in learning a 
MRF. Therefore, our algorithm for learning the MRFs can potentially be 
used to learn the directed information graphs as well. 

The paper is organized as follows. We present some preliminaries in the 
next section. In Section 3, we define loosely-connected MRFs and show that 
several previously studied models are examples of this family. In Section 4, 
we present our algorithm and show the conditions required to correctly re- 
cover the graph. We also provide the concentration results in this section. In 
Section 5, we apply our algorithm to the general Ising models studied in Sec- 
tion 3 and evaluate its sample complexity and computational complexity in 
each case. In Section 6, we show that our algorithm achieves even lower com- 
putational complexity when the Ising model is ferromagnetic. Experimental 
results are presented in Section 7. 

2. Preliminaries. 

2.1. Markov Random Fields (MRFs). Let X = {Xi, X2, . . . , Xp) be a 
random vector with distribution P and G = {V, E) be an undirected graph 
consisting of |y| = p nodes with each node i associated with the i^^ element 
Xi of X. Before we define an MRF, we introduce the notation Xs to denote 
any subset S of the random variables in X. A random vector and graph 
pair (X, G) is called an MRF if it satisfies one of the following three Markov 
properties: 

1. Pairwise Markov: Xi _L j}, V(2, j) where _L denotes in- 
dependence. 

2. Local Markov: Xi _L Xy^jjuArjI^ATj, Vi € V, where Ni is the set of 
neighbors of node i. 

3. Global Markov: Xa -L Xb\Xs, if S separates A,B on G. In this case, 
we say G is an I-map of X. Further if G is an I- map of X and the 
global Markov property does not hold if any edge of G is removed, 
then G is called a minimal I-map of X. 

In all three cases, G encodes a subset of the conditional independence rela- 
tions of X and we say that X is Markov with respect to G. We note that 
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the global Markov property implies the local Markov property, which in turn 
implies the pairwise Markov property. 

When P{x) > 0,Vx, the three Markov properties are equivalent, i.e., if 
there exists a G under which one of the Markov properties is satisfied, then 
the other two are also satisfied. Further, in the case when P{x) > 0, Vx, there 
exists a unique minimal I-map of X. The unique minimal I- map G = {V, E) 
is constructed as follows: 

1. Each random variable Xi is associated with a node i € V. 

2. (i, j) ^ E if and only if Xi _L Xj\Xy\^^,ij-j. 

In this case, we consider the case P{x) > 0,yx and are interested in 
learning the structure of the associated unique minimal I-map. We will also 
assume that, for each i, Xi takes on values in a discrete, finite set X. We 
will also be interested in the special case where the MRF is an Ising model, 
which we describe next. 

2.2. Ising Model. Ising models are a type of well-studied pairwise Markov 
random fields. In an Ising model, each random variable Xi takes values in the 
set X = {—1,-1-1} and the joint distribution is parameterized by constants 
called edge coefficients J and external fields h : 



where Z is a normalization constant to make P(x) a probability distribution. 
If /i = 0, we say the Ising model is zero-field. If Jij > 0, we say the Ising 
model is ferromagnetic. 

Ising models have the following useful property. Given an Ising model, the 
conditional probability P{Xy\s\xs) corresponds to an Ising model on F \ S* 
with edge coefficients Jij,i,j G V\S unchanged and modified external fields 
hi + h'^,i G y\5, where h'^ = J2{^. ,j)eE,jes '^ij-'^j additional external 

field on node i induced by fixing Xs = xs- 

2.3. Random Graphs. A random graph is a graph generated from a prior 
distribution over the set of all possible graphs with a given number of nodes. 
Let Xp be a function on graphs with p nodes and let C be a constant. We 
say Xp ^ C almost always for a family of random graphs indexed by p if 
P{Xp > C) — )• 1 as p — )• oo. Similarly, we say Xp ^ C almost always for a 
family of random graphs if Ve > 0, P{\Xp ~ C\ > e) — )■ 1 as p — )■ oo. This is 
a slight variation of the definition of almost always in [1]. 

The Erdos-Renyi random graph G{p, ^) is a graph on p nodes in which the 
probability of an edge being in the graph is ^ and the edges are generated 
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independently. We note that, in this random graph, the average degree of a 
node is c. In this paper, when we consider random graphs, we only consider 
the Erdos-Renyi random graph Q{p,^)- 

2.4. High-Dimensional Structure Learning. In this paper, we are inter- 
ested in inferring the structure of the graph G associated with an MRF 
{X,G). Wc will assume that P{x) > 0,Va;, and G will refer to the corre- 
sponding unique minimal I-map. The goal of structure learning is to design 
an algorithm that, given n i.i.d. samples {X^''^'^^^ from the distribution 
P, outputs an estimate G which equals G with high probability when n is 
large. We say that two graphs are equal when their node and edge sets are 
identical. 

In the classical setting, the accuracy of estimating G is considered only 
when the sample size n goes to infinity while the random vector dimension p 
is held fixed. This setting is restrictive for many contemporary applications, 
where the problem size p is much larger than the number of samples. A more 
suitable assumption allows both n and p to become large, with n growing at 
a slower rate than p. In such a case, the structure learning problem is said 
to be high-dimensional. 

An algorithm for structure learning is evaluated both by its computational 
complexity and sample complexity. The computational complexity refers to 
the number of computations required to execute the algorithm, as a function 
of n and p. When G is a deterministic graph, wc say the algorithm has 
sample complexity f{p) if, for n = 0{f{p)), there exist constants c and 
a > 0, independent of p, such that Pr(G = G) > 1 — ^ for all P which 
are Markov with respect to G. When G is a random graph drawn from 
some prior distribution, we say the algorithm has sample complexity f{p) if 
the above is true almost always. In the high-dimensional setting n is much 
smaller than p. In fact, we will show that, for the algorithms described in 
this paper, f{p) = logp. 

3. Loosely Connected MRFs. Loosely connected Markov random 
fields are undirected graphical models in which the number of short paths 

between any pair of nodes is small. Roughly speaking, a path between two 
nodes is short if the dependence between two node is non-ncgligiblc even if 
all other paths between the nodes are removed. Later, we will more precisely 
quantify the term "short" in terms of the correlation decay property of the 
MRF. For simplicity, we say that a set S separates some paths between nodes 
i and j if removing S disconnects these paths. In such a graphical model, if 
i,j are not neighbors, there is a small set of nodes S separating all the short 
paths between them, and conditioned on this set of variables Xs the two 
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variables Xi and Xj are approximately independent. On the other hand, if 
i,j are neighbors, there is a small set of nodes T separating all the short 
non-direct paths between them, i.e, the direct edge is the only short path 
connecting the two nodes after removing T from the graph. Conditioned 
on this set of variables Xt, the dependence of Xi and Xj is dominated by 
the dependence over the direct edge hence is bounded away from zero. The 
following necessary and sufficient condition for the non-existence of an edge 
in a graphical model shows that both the sets S and T above are essential 
for learning the graph, which we have not seen in prior work. 

Lemma 3.1. Consider two nodes i and j in G. Then, {i,j) ^ E if and 
onlyif3S,yT,X,±Xj\Xs,XT. 

Proof. Recall from the definition of the minimal I-map that ^ E 
if and only if Xi _L Therefore, the statement of the lemma is 

equivalent to 

I{Xi- Xj\Xy\s^ij^) = <^ minmax/(Xi; Xj\Xs, Xt) = 0, 

where I{Xi]Xj\Xs) denotes the mutual information between Xi and Xj 
conditioned on Xg, and we have used the fact that Xi _L Xj\Xs is equivalent 
to I{Xi-Xj\Xs) = 0. Notice that 

minmax/(Xj; Xj|X5, Xt) = min max XjlXj^/) 

ST S T'^S 

and maxT/35 1{Xi; Xj\Xx') is an increasing function in S. The minimization 
over S is achieved at S = V \ {i,j}, i.e., 

I{Xi;Xj\Xy\{ijy) = minm.axI{Xi]Xj\Xs,XT). 

□ 

This lemma tells that, if there is not an edge between node i and j, we 
can find a set of nodes S such that the removal of S from the graph separates 
i and j. From the global Markov property, this implies that Xi _L Xj\Xs- 
However, as Example 1.1 shows, the converse is not true. In fact, for S being 
the empty set or 5 = 0, we have Xi _L X2\Xs, but (1,2) is indeed an edge 
in the graph. The above lemma completes the statement in the converse 
direction, showing that we should also introduce a set T in addition to the 
set S to correctly identify the edge. 

Motivated by this lemma, we define loosely connected MRFs as follows. 
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Definition 3.2. We say a MRF is {Di,D2,e) -loosely connected if 

1. for any {i,j) E, 3S with \S\ < Di, VT with \T\ < D2, 

A(X,;X,-|X5,Xt)< |, 

2. for any G E, V5 with \S\ < Di , 3T with \T\ < D2, 

A{Xi;X,\Xs,XT)>e, 
for some conditional independence test A. 



The conditional independence test A should satisfy A{Xi; Xj\Xs, Xt) = 
if and only if Xi _L Xj\Xs, Xt- In this paper, we use two types of condi- 
tional independence tests: 

• Mutual Information Test: 

^{Xi;Xj\Xs,XT) = I{Xi;Xj\Xs,XT)- 

• Probability Test: 

A{Xi;Xj\Xs,XT) = max \P{xi\xj,xs,XT) - Pixi\x'j,xs,XT)\. 

Xi,Xj,Xj,XS,XT 

Later on, we will see that the probability test gives lower sample complex- 
ity for learning Ising models on bounded degree graphs, while the mutual 
information test gives lower sample complexity for learning Ising models on 
graphs with unbounded degree. 

Note that the above definition restricts the size of the sets S and T to make 
the learning problem tractable. We show in the rest of the section that several 
important Ising models are examples of loosely connected MRFs. Unless 
otherwise stated, we assume that the edge coefficients Jij are bounded, i.e., 

J min ^ — J max • 

3.1. Bounded Degree Graph. We assume the graph has maximum degree 
d. For any {i,j) E, the set S = Ni of size at most d separates i and j, and 
for any set T we have A{Xi; Xj\Xs, Xt) = 0. For any G E, the set 

T = Ni\j of size at most d — 1 separates all the non-direct paths between i 
and j. Moreover, we have the following lower bound for neighbors from [6, 
Proposition 2]. 
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Proposition 3.3. When i,j are neighbors and T = Ni \j, there is a 
choice of Xi,Xj,x'j,xs,XT such that 



\P{Xi\Xj,Xs,XT) - P{Xi\Xj,Xs,XT)\ > 77^7 — ; 57 = £• 



tanh(2Jtnm; a 
2e^'^max _j_ 2e" 

□ 



Therefore, the Ising model on a bounded degree graph with maximum 
degree d is a {d,d — 1, e)-loosely connected MRF. We note that here we do 
not use any correlation decay property, and we view all the paths as short. 

3.2. Bounded Degree Graph, Correlation Decay and Large Girth. In this 
subsection, we still assume the graph has maximum degree d. From the pre- 
vious subsection, we already know that the Ising model is loosely connected. 
But we show that when the Ising model is in the correlation decay regime 
and further has large girth, it is a much sparser model than the general 
bounded degree case. 

Correlation decay is a property of MRFs which says that, for any pair of 
nodes i,j, the correlation of Xi and Xj decays with the distance between i,j. 
When a MRF has correlation decay, the correlation of Xi and Xj is mainly 
determined by the short paths between nodes and the contribution from 
the long paths is negligible. It is known that when Jmax is small compared 
with d, the Ising model has correlation decay. More specifically, we have 
the following lemma, which is a consequence of the strong correlation decay 
property [21, Theorem 1]. 

Lemma 3.4. Assume {d — l)tanhJinax < 1- Vi,j G V,d{i,j) = I, then 
for any set S and Vxj, Xj,x'j,xs, 

\P{xi\xj,xs) - P{xi\x'j,xs)\ < AJ^s.^d[{d- 1) tanh Jmax]'" = /3a', 

"'^ere /3 = ^"'^ a = {d-l) tanh J^ax- 

Proof. Forsomegivenxj,Xj,Xj-,X5',w.l.o.g. assume P(xj|xj,xs') > P(xj|Xj-, xs). 
Applying the [21, Theorem 1] with A = {j} U S, we get 

P{xi\x'-,xs) 

\P{Xi\Xj,Xs) - P{xi\x,,xs)\ <1 - p. I V 

^ P{Xi\Xj,Xs) 

_ g-4Jmaxd[{<i-l)tanhJn,ax]'*(^'^'-l 

<4Jmaxd[(d - 1) tanh Jmax]''^''^'^-'. 

□ 
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This lemma implies that, in the correlation decay regime (d—l) tanh Jmax < 
1, the Ising model has exponential correlation decay, i.e., the correlation be- 
tween a pair of nodes decays exponentially with their distance. We say that 
a path of length I is short if /3a' is above some desired threshold. 

The girth of a graph is defined as the length of the shortest cycle in the 
graph, and large girth implies that there is no short cycle in the graph. 
When the Ising model is in the correlation decay regime and the girth of the 
graph is large in terms of the correlation decay parameters, there is at most 
one short path between any pair of non-neighbor nodes, and no short paths 
other than the direct edge between any pair of neighboring nodes. Naturally, 
we can use S of size 1 to approximately separate any pair of non-neighbor 
nodes and do not need T to block the other paths for neighbor nodes as 
the correlations are mostly due to the direct edges. Therefore, we would 
expect this Ising model to be (1, 0, e)-loosely connected for some constant 
e. In fact, the following theorem gives an explicit characterization of e. The 
condition on the girth below is chosen such that there is at most one short 
path between any pair of nodes, so a path is called short if it is shorter than 
half of the girth. 

Theorem 3.5. Assume [d— l)tanh J^ax < 1 o^nd the girth 

ln[g(lyln2)] 

In^ 

a 

where A= ^{l- e-4'^min)g-8dJmax_ ^ = A^Ae^'^'''^^'' . Then S E, 

min max \P{xi\xj,xs) — Pixi\x'j,xs)\ > e, 

ScV\{iUj} Xi,Xj,x'xs 
\S\<Di 

and ^ E, 

min max \P{xi\xj,xs) — P{xi\x'j,xs)\ < -■ 
scv\{iuj} 

\S\<Di 

Proof. See Appendix A. □ 

3.3. Erdos-Renyi Random Graph Q{p, ^) and Correlation Decay. We as- 
sume the graph G is generated from the prior Q{p, ^) in which each edge is 
in G with probability | and the average degree for each node is c. For this 

random graph, the maximum degree scales as with high probability 

[1]. Thus, we cannot use the results for bounded degree graphs even though 
the average degree remains bounded as p — )• oo. 
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It is known from prior work [2] that, for ferromagnetic Ising models, i.e, 
Jij > for any i and when Jmax is small compared with the average degree 
c, the random graph is in the correlation decay regime and the number of 
short paths between any pair of nodes is at most 2 asymptotically. We show 
that the same result holds for general Ising models. Our proof is related 
to the techniques developed in [2], but certain steps in the proof of [2] do 
rely on the fact that the Ising model is ferromagnetic, so the proof does not 
directly carry over. We point out similarities and differences as we proceed 
in Appendix C. 

More specifically, letting 7p = j^^f^ for some K G (3,4), the following 
theorem shows that nodes that are at least 7^ hops from each other have 
negligible impact on each other. As a consequence of the following theorem, 
we can say that a path is short if it is at most 7p hops. 

Theorem 3.6. Assume a = ctanh J^ax < 1- Then, the following prop- 
erties are true almost always. 

(1) Let G he a graph generated from the prior G{p, |). If i,j are not neigh- 
bors in G and S separates all the paths shorter than 7p hops between i,j, 
then Vxj, Xj,x'pXs, 

\P{xi\xj,xs) - P{xi\x'j,xs)\ < |-B(z,7p)|(tanh Jmax)^'' =o{p~'^), 

for all Ising models P on G, where k = and B{i,jp) is the set of all 

nodes which are at most 7p hops away from i.. 

(2) There are at most two paths shorter than 7^ between any pair of nodes. 

Proof. See Appendix C. □ 

The above result suggests that for Ising models on the random graph there 
are at most two short paths between non-neighbor nodes and one short non- 
direct path between neighboring nodes, i.e., it is a (2, 1, e)-loosely connected 
MRF. Further the next two theorems prove that such a constant e exists. 
The proofs are in Appendix C. 

Theorem 3.7. For any {i,j) E, let S be a set separating the paths 
shorter than 7p between i,j and assume \S\ <3, then almost always 

I{Xi-Xj\Xs) = o{p-''^). 



□ 
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Theorem 3.8. For any £ E, let T he a set separating the non- 

direct paths shorter than 7^ between i,j and assume \T\ < 3, then almost 
always 

/(X,;X,|Xt) = 0(1). 

□ 

4. Our Algorithm and Concentration results. Learning the struc- 
ture of a graph is equivalent to learning if there exists an edge between 
every pair of nodes in the graph. Therefore, we would like to develop a 
test to determine if there exists an edge between two nodes or not. From 
Definition 3.2, it should be clear that learning a loosely connected MRF is 
straightforward. For non-neighbor nodes, we search for the set S that sep- 
arates all the short paths between them, while for neighboring nodes, we 
search for the set T that separates all the non-direct short paths between 
them. As the MRF is loosely connected, the size of the above sets are small, 
therefore the complexity of the algorithm is low. 

Given n i.i.d. samples {X^^^}'^^^ from the distribution the empirical dis- 
tribution P is defined as follows. For any set A, 

1 " 

i=l ^ ' 

Let A be the empirical conditional independence test which is the same as 
A but computed using P. Our first algorithm is as follows. 

Algorithm 1 CondST{Di, D2,e) 
for i,j £ V do 

if 3S with \S\ < Di,VT with IT] < D2, A{X^;X,\Xs,Xt) < f 
then 

else 

end if 
end for 



For clarity, when we specifically use the mutual information test (or the 
probability test), we denote the corresponding algorithm by CondSTj (or 
CondSTp). When the empirical conditional independence test A is close to 
the exact test A, we immediately get the following theorem. 
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Theorem 4.1. For a {Di,D2,e) -loosely connected MRF, if 

\A{Xi;Xj\XA) - A(Xf,Xj\XA)\ < \ 

for any node i,j and set A with \A\ < Di + D2, then CondST{Di, D2, e) re- 
covers the graph correctly. The running time for the algorithm is 0{np^^~^^^~^'^). 

Proof. The correctness is immediate. We note that, for each pair of i,j 
in V, we search S,T in V. So the possible combinations of {i,j,S,T) is 
0{p^^^^^~^'^) and we get the running time result. □ 

When the MRF has correlation decay, it is possible to reduce the compu- 
tational complexity by restricting the search space for the set S and T to a 
smaller candidate neighbor set. In fact, for each node i, the nodes which are 
a certain distance away from i have small correlation with Xi. As suggested 
in [6], we can first perform a pairwise correlation test to eliminate these 
nodes from the candidate neighbor set of node i. To make sure the true 
neighbors are all included in the candidate set, the MRF needs to satisfy 
an additional pairwise non-degeneracy condition. Our second algorithm is 
as follows. 

Algorithm 2 CondST_Pre{Di, D2,e,e') 
for i £ V do 

Li = {j eV\i, max \P{xi\xj) - P{xi\x'j)\ > 

for j £ Li do 

if as C Li with |S| < L'i,VT C L, witli |r| < D2,A{Xi-Xj\Xs,XT) < f then 

else 

end if 
end for 
end for 



The following theorem provides conditions under which the second algo- 
rithm correctly learns the MRF. 

Theorem 4.2. For a {Di, D2,€) -loosely connected MRF with 
(1) max \P{xi\xj) — P{xi\x'j)\ > e' 

for any {i,j) G E, if 



\P{Xi\Xj) - P{Xi\Xj)\ < 



e' 
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for any node i,j and Xi,Xj, and 



\A{Xi;X,\XA) - A{Xf,Xj\XA)\ < | 



for any node i, j and set A with \A\ < D1+D2, thenCondST-Pre{Di,D2,e,e') 
recovers the graph correctly. Let L = maxj The running time for the al- 
gorithm is 0{np'^ + npL^^~^^^~^^). 

Proof. By the pairwise non-degeneracy condition (1), the neighbors of 
node i are all included in the candidate neighbor set Lj. We note that this 
preprocessing step excludes the nodes whose correlation with node i is below 
^. Then in the inner loop, the correctness of the algorithm is immediate. 
The running time of the correlation test is 0{np^). We note that, for each 
i in y, we loop over j in Li and search S and T in Li. So the possible 
combinations of {i,j,S,T) is 0{pL^^^^^^^). Combining the two steps, we 
get the running time of the algorithm. □ 

Note that the additional non-degeneracy condition (1) required for the 
second algorithm to execute correctly is not satisfied for all graphs (recall 
Example 1.1). 

4.1. Concentration Results. In this subsection, we show a set of concen- 
tration results for the empirical quantities in the above algorithm for general 
discrete MRFs, which will be used to obtain the sample complexity results 
in Section 5 and Section 6. 

Lemma 4.3. Fix 7 > 0. Let L = maxj \Li\. For Va > 0, 
i. Assume j < \. If 



with probability 1 — |^ for some constant ci . 
2. Assume yS C V,\S\ < Di + D2 + I, P{xs) > S for some constant 6, 
and 7 < I . // 



n > 



2[(2 + a)logp + 21og|A:'|] 

^2 



then £ V, Vxj, Xj 



P{xi\xj) - P{xi\xj)\ < 47 



n > 



2 [(1 + g) logp + {Di + D2 + I) log L + {Di + D2 + 2) log 

7^ 
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then Vi G V,yj G Li,\/S C Li,\S\ < Di + D2, Vxj, Xj, xs, 

27 

|P(xi|xj,xs') - < — 

with probability 1 — ^ for some constant 02- 
3. Assume 7 < ^^.^^ul+o^+i < 1- If 

2 [(1 + a) logp + {Di + D2 + I) log L + {Di + D2 + 2) log | A-]] 
n > 2 ^' 

i/ien Vi, J G V, |5| < Di + Vxj, Xj,xs, 

\i{Xi;Xj\Xs) - I{Xi;Xj\Xs)\ < 8\Xf'+''^+^^ 

with probability 1 — ^ for some constant C3, 

Proof. See Appendix D. □ 

This lemma could be used as a guideline on how to choose between the 
two conditional independence tests for our algorithm to get lower sample 
complexity. The key difference is the dependence on the constant 6, which is 
a lower bound on the probability of any xs with the set size \S\ < D1+-D2+I. 
The probability test requires a constant 5 > to achieve sample complexity 
n = O(logp), while the mutual information test does not depend on 6 and 
also achieves sample complexity n = 0{logp). We note that, while both 
tests have O(logp) sample complexity, the constants hidden in the order 
notation may be different for the two tests. For Ising models on bounded 
degree graphs, we show in the next section that a constant 5 > exists, and 
the probability test gives a lower sample complexity. On the other hand, for 
Ising models on the Erdos-Renyi random graph G{p, we could not get a 
constant 5 > as the maximum degree of the graph is unbounded, and the 
mutual information test gives a lower sample complexity. 

5. Computational Complexity for General Ising Models. In this 
section, we apply our algorithm to the Ising models in Section 3. We eval- 
uate both the number of samples required to recover the graph with high 
probability and the running time of our algorithm. The results below are 
simple combinations of the results in the previous two sections. Unless oth- 
erwise stated, we assume that the edge coefficients Jij are bounded, i.e., 
</min < \Jij\ < ^max- Throughout this section, we use the notation a; A y to 
denote the minimum of x and y. 
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5.1. Bounded Degree Graph. We assume the graph has maximum degree 
d. First we have the following lower bound on the probability of any finite 
size set of variables. 

Lemma 5.1. cVyxs, P{xs) > 2-1^1 exp(-2((i + l)|5|Vmax). 

Our algorithm with the probability test for the bounded degree graph 
case reproduces the algorithm in [6]. For completeness, we state the theorem 
below without a proof since it is nearly identical to the result in [6], except 
for some constants. 

Theorem 5.2. Let e be defined as in Proposition 3.3. Define 

5 = 2-2'^+! exp(-2(d + l){2d - ifjra..)- 

Letj = fAl<l.Ifn> ^^'^^^^^-^^J , the algorUhm 

GondSTp{d, d — 1, €2) recovers G with probability 1 — ^ for some constant 
c. The running time of the algorithm is 0{np^'^^^). □ 

5.2. Bounded Degree Graph, Correlation Decay and Large Girth. We as- 
sume the graph has maximum degree d. We also assume that the Ising model 
is in the correlation decay regime, i.e., {d — 1) tanh Jmax < 1, and the graph 
has large girth. Combining Theorem 3.5, Theorem 4.1 and Lemma 4.3, We 
can show that the algorithm CondSTp{l,Q,e) recovers the graph correctly 
with high probability for some constant e, and the running time is 0{np^) 
for n = O(logp). 

We can get even lower computational complexity using our second algo- 
rithm. The key observation is that, as there is no short path other than the 
direct edge between neighboring nodes, the correlation over the edge dom- 
inates the total correlation hence the pairwise non-degeneracy condition is 
satisfied. We note that the length of the second shortest path between neigh- 
boring nodes is no less than g — 1. 

Lemma 5.3. Assume that [d — 1) tanh Jmax < 1; o.'ri'd the girth g satisfies 

In r/3(i Vln2)l 
m - 

a 

where A = 3^(1 — e"'^'^™). Let e' = A8A. y{i,j) G E, we have 
max \P{xi\xj) — P{xi\x'j)\ > e . 
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Proof. See Appendix A. □ 



Using Lemma 3.4, if node j is of distance l^i = i hops from node i, we 



Using this lemma, we can apply our second algorithm to learn the graph, 
sin^ 
have 

max^ \P{xi\xj) - P{xi\x'j)\ < fia-'^' < 



e' 



Therefore, in the correlation test, Lj only includes nodes within distance 
1^1 from i and the size |Lj| < since the maximum degree is d; i.e., 
L = maxj |Lj| < d'^', which is a constant independent of p. Combining 
the previous lemma. Theorem 3.5, Theorem 4.2 and Lemma 4.3, we get the 
following result. 

Theorem 5.4. Assume {d— 1) tanh Jmax < 1- Assume g, e and e' satisfy 
Theorem 3.5 and Lemma 5.3. Let 5 be defined as in Theorem 5.2. Let 7 = 

— A — A - If 

2 [(2 + a) \ogp + 21^, log d + 3 log 2] 



n > 



7^ 



the algorithm CondST _Prep{\, 0, e, e') recovers G with probability 1 — ^ for 
some constant c. The running time of the algorithm is 0{np'^). □ 

5.3. Erdos-Renyi Random Graph Q{p,^) and Gorrelation Decay. We as- 
sume the graph G is generated from the prior Q{p,^) in which each edge is 
in G with probability ^ and the average degree for each node is c. Because 
the random graph has unbounded maximum degree, we cannot lower bound 
for the probability of a finite size set of random variables by a constant, for 
all p. To get good sample complexity, we use the mutual information test 
in our algorithm. Combining Theorem 3.7, Theorem 3.8, Theorem 4.1 and 
Lemma 4.3, we get the following result. 

Theorem 5.5. Assume ctanh J^ax < 1- There exists a constant e > 

such that, for 7 = (gfy)^ A < 1, i/n > ^^^^ ^ ^2^^ — ihe algorithm 
CondSTj{2,l, e) recovers the graph G almost always. The running time of 
the algorithm is 0{np^). □ 
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6. Computational Complexity for Ferromagnetic Ising Models. 

Ferromagnetic Ising models are Ising models in which all the edge coefficients 
Jij are nonnegative. We say is an edge if Jij > 0. One important 

property of ferromagnetic Ising models is association, which characterizes 
the positive dependence among the nodes. 

Definition 6.1. [8J We say a collection of random variables X = 
{Xi,X2, . . ■ ,Xn) is associated, or the random vector X is associated, if 

Cov(/(X),5(X))>0 

for all nondecreasing functions f and g for which Ef{X),Eg{X),Ef{X)g{X) 
exist. □ 

Proposition 6.2. [11] The random vector X of a ferromagnetic Ising 
model (possibly with external fields) is associated. □ 

A useful consequence of the Ising model being associated is as follows. 

Corollary 6.3. Assume X is a zero field ferromagnetic Ising model. 
For any P{Xi = l,Xj = l)>l> P{Xi = l,Xj = -1). 

Proof. See Appendix B. □ 

Informally speaking, the edge coefficient Jij > means that i and j are 
positively dependent over the edge. For any path between as all the 
edge coefficients are positive, the dependence over the path is also positive. 
Therefore, the non-direct paths between a pair of neighboring nodes i,j 
make Xi and Xj, which are positively dependent over the edge even 
more positively dependent. This observation has two important implications 
for our algorithm. 

1. We do not need to break the short cycles with a set T in order to detect 
the edges, so the maximization in the algorithm can be removed. 

2. The pairwise non-degeneracy is always satisfied for some constant e', 
so we can apply the correlation test to reduce the computational com- 
plexity. 

6.1. Bounded Degree Graph. We assume the graph has maximum de- 
gree d. We have the following non-degeneracy result for ferromagnetic Ising 
models. 
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Lemma 6.4. V(i, j) £ E,S cV\{i,j} and \/xs, 

max^ \P{xi\xj,xs) - P{xi\x'j,xs)\ ^^(1 " e"^'^™)e"'*l^^l'^™. 

Proof. See Appendix A. □ 

The following theorem justifies the remarks after Corollary 6.3 and shows 
that the algorithm with the preprocessing step CondST_Pre{d, 0, e, e') can 
be used to learn the graph, where e, e' are obtained from the above lemma. 
Recall that Lj is the candidate neighbor set of node i after the preprocessing 
step and L = maxj | Lj | . 

Theorem 6.5. Let 

16^ e je , e j, 

and S be defined as in Theorem 5.2. Let 7=|2'^ff'^|- ^/ 

2\{l + a)\ogp + {d + 1) log L + (d + 2) log 2] 
n > — 



9 ' 

T 

the algorithm CondST_Prep{d, 0, e, e') recovers G with probability 1 — ^ /or 
some constant c. The running time of the algorithm is 0{np'^ + npL'^^^). If 
we further assume that {d — l)tanh Jmax < 1; then the running time of the 
algorithm is 0{np^). 

Proof. We choose IS"! < d and T = in our algorithm, and we have 
|A^s| < d'^ as the maximum degree is d. By Lemma 6.4, we have 

max \P{xi\xj,xs) — P{xi\xj,xs)\ >e 

for any |5| < d. Therefore, the Ising model is a (d, 0, e)-loosely connected 
MRF. Note that Lemma 6.4 is applicable to any set S (not necessarily the 
set S in the conditional independence test). Applying Lemma 6.4 again with 
5* = 0, we get the pairwise non-degeneracy condition 

max \P{xi\xj) — P[xi\xj)\ > e. 

Combining Theorem 4.2 and Lemma 4.3, we get the correctness of the algo- 
rithm. The running time is 0{np^ + npL'^'^^), which is at most 0{np'^~^'^). 

When {d— 1) tanh Jmax < 1, as the Ising model is in the correlation decay 
regime, L = maxj |Lj| < d^^' is a constant independent of p as argued for 
Theorem 5.4. Therefore, the running time is only 0{np'^) in this case. □ 
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6.2. Erdos-Renyi Random Graph Q{p, |) and Correlation Decay. When 
the Ising model is ferromagnetic, the result for the random graph is similar 
to that of a deterministic graph. For each graph sampled from the prior 
distribution, the dependence over the edges is positive. If i,j are neighbors 
in the graph, having additional paths between them makes them more pos- 
itively dependent, so we do not need to block those paths with a set T to 
detect the edge and set D2 = 0. In fact, we can prove a stronger result for 
neighbor nodes than the general case. The following result also appears in 
[2], but we are unable to verify the correctness of all the steps there and so 
we present the result here for completeness. 

Theorem 6.6. Vi G y,Vj G Ni, let S be any set with \S\ < 2, then 
almost always 

I{X^;X,\Xs)=n{l). 

Proof. See Appendix C. □ 

Moreover, the pairwise non-degeneracy condition in Theorem 6.5 also 
holds here. We can thus use algorithm CondST_Pre{2,0,€,e') to learn the 
graph. Without the pre-processing step, our algorithm is the same as in [2]. 
We show in the following theorem that using the pre-processing step our 
algorithm achieves lower computational complexity in the order of p. 

Theorem 6.7. Assume ctanh J^ax < 1 and the Ising model is ferro- 
magnetic. Let e' he defined as in Theorem 6.5. There exists a constant e > 

such that, for 7 = f| A (^)^ ^ W2 ^ ^' ^Z"- > — ^^"1^2 ^ "'"^ — the 
algorithm CondST_Prei{2,0,e, e') recovers the graph G almost always. The 
running time of the algorithm is 0{np'^). 

Proof. Combining Theorem 3.7, Theorem 3.8, Theorem 4.2, Lemma 4.3 
and Lemma 6.4, we get the correctness of the algorithm. 

From Theorem 3.6 we know that if j is more than 7^ hops away from 
i, the correlation between them decays as o{p^'^). For the constant thresh- 
old I", these far-away nodes are excluded from the candidate neighbor set 
Li when p is large. It is shown in the proof of [13, Lemma 2.1] that for 
Q{p,^), the number of nodes in the 7p-ball around i is not large with high 
probability. More specifically, Vz G V, \B{i,jp)\ = 0{c'^p logp) almost always, 
where B{i,^p) is the set of all nodes which are at most 7^ hops away from 
i. Therefore we get 

L = max|Lj| < |i?(i,7p)| = 0(c'^p logp) = 0{pT< \ogp) = 0{p^). 
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So the total running time of algorithm CondSTj(2, 0, e, e') is 0{np'^+npL^) = 
0{np^). □ 

7. Experimental Results. In this section, we present experimental 
results to show that importance of the choice of a non-zero D2 in correctly 
estimating the edges and non-edges of the underlying graph of a MRF. We 
evaluate our algorithm CondSTj{Di, D2,e), which uses the mutual informa- 
tion test and does not have the preprocessing step, for general Ising models 
on grids and random graphs as illustrated in Figure 1. In a single run of the 
algorithm, we first generate the graph G = {V,E): for grids, the graph is 
fixed, while for random graphs, the graph is generated randomly each time. 
After generating the graph, we generate the edge coefficients uniformly from 
[-^max, -^min] U [Jmin,^max], where Jmin = 0.4 and Jmax = 0.6. We then 
generate samples from the Ising model by Gibbs sampling. The sample size 
ranges from 400 to 1000. The algorithm computes, for each pair of nodes i 
and j, 

lii = min max I(Xi;Xj\Xs,XT) 
|5|<r>i |T|<D2 

using the samples. For a particular threshold e, the algorithm outputs (i, j) 
as an edge if lij > e and gets an estimated graph G = {V,E). We select e 
optimally for each run of the simulation, using the knowledge of the graph, 
such that the number of errors in E, including both errors in edges and 
non-edges, is minimized. The performance of the algorithm in each case 
is evaluated by the probability of success, which is the percentage of the 
correctly estimated edges, and each point in the plots is an average over 50 
runs. We then compare the performance of the algorithm under different 
choices of Di and D2. 




Fig 1: Illustrations of four-neighbor grid, eight-neighbor grid and the random 
graph. 



The experimental results for the algorithm with Di = 0, . . . , 3 and D2 = 
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0, 1 applied to eight- neighbor grids on 25 and 36 nodes are shown in Fig- 
ure 2. We omit the results for four-neighbor grids as the performances of 
the algorithm with D2 = and D2 > are very close. In fact, four- neighbor 
grids do not have many short cycles and even the shortest non-direct paths 
are weak for the relatively small Jmax we choose, therefore there is no benefit 
using a set T to separate the non-direct paths for edge detection. However, 
for eight-neighbor grids which are denser and have shorter cycles, the proba- 
bility of success of the algorithm significantly improves by setting D2 = 1 , as 
seen from Figure 2. It is also interesting to note that increasing from Di = 2 
to Di = 3 does not improve the performance, which implies that a set S of 
size 2 is sufficient to approximately separate the non-neighbor nodes in our 
eight-neighbor grids. 



5x5 eight-neighbor gnd 



6x6 eight-neighbor grid 



if 

^...4..- 



8 - 























sample size 



sampie size 



Fig 2: Plots of the probability of success versus the sample size for 5x5 and 
6x6 eight-neighbor grids with i^i = 0, . . . , 3 and D2 = 0, 1. 



The experimental results for the algorithm with Di = 0, . . . , 3 and D2 = 
0, 1 applied to random graphs on 20 and 30 nodes are shown in Figure 3. 
For a random graph on n nodes with average degree d, each edge is included 
in the graph with probability ^^^j and is independent of all other edges. In 
the experiment, we choose average degree 5 for the graphs on 20 nodes and 
7 for the graphs on 30 nodes. From Figure 3, the probability of success of 
the algorithm improves a lot when we increase D2 from to 1, which is very 
similar to the result of the eight-neighbor grids. We also note that, unlike the 
previous case, the algorithm with Di = 3 does have a better performance 
than with Di = 2 as there might be more short paths between a pair of 
nodes in random graphs. 

In a true experiment where only the data is available and no prior knowl- 
edge of the MRF is available, the choice of e itself may affect the performance 
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Random graph on 20 nodes with average degree 5 Random graph on 30 nodes with average degree 7 
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Fig 3: Plots of the probability of success versus the sample size for random 
graphs with Di = 0, . . . , 3 and D2 = 0, 1. 



of the algorithm. At this time, we don not have any theoretical results to 
inform the choice of e. We briefly present a heuristic, which seems reason- 
able. However, extensive testing of the heuristic is required before we can 
confidently state that the heuristic is reasonable, which is beyond the scope 
of this paper. Our proposed heuristic is as follows. 

For a given Di and D2, we compute lij for each pair of nodes i and j. If 
the choice of Di and D2 is good, lij is expected to be close to for non-edges 
and away from for edges. Therefore, we can view the problem of choosing 
the threshold e as a two-class hypothesis testing, where the non-edge class 
concentrates near while the edge class is more spread out. If we view /, the 
collection of lij for all i and j, as samples generated from the distribution 
of some random variable Z, then the hypothesis testing problem can viewed 
as one of finding the right e such that the density of Z has a big spike below 
e. One heuristic is to first estimate a smoothed density function from / via 
kernel density estimation [9] and then set e to be the right boundary of the 
big spike near 0. 

In order to choose proper D\ and D2 for the algorithm, we can start with 
(1^1,1)2) = (0,0). At each step, we run the algorithm with two pairs of 
values {Di + 1,1)2) and {Di,D2 + 1) separately, and choose the pair that 
has a more significant change on the density estimated from / as the new 
value for (Di,L>2)- We continue this process and stop increasing Di or D2 
if at some step there is no significant change for either pair of values. 

Justifying this heuristic either through extensive experimentation or the- 
oretical analysis is a topic for future research. 
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APPENDIX A: BOUNDED DEGREE GRAPH 
A.l. Proof of Lemma 5.1. Let Ng be the neighbor nodes of S. 

P{xs) = ^ P{xNs)P{xs\xNs) 
> min P(xs\xn3) 

Xs,XNg 

exp{x'^JssXs + x'^JsnsXNs) 
= mm ^-^p — 2 

^s,XNs Y,^'^exp{x'g Jssx's+x'g JsNsXNs) 

min^s^xN^ exp{x'lJssXs + x'^JsNs^Ns) 
> 2 

~2l-5| max^/,^^^^ exjp{x'/jssx'g + x'/'JsnsXNs) 

^ exp(-2(|5pJ^ax+|g||jV5|Jmax)) 

-21^1 exp(2(|5|2 J„,ax + \S\\Ns\Jm..)) 

=2-1^1 exp(-2(|5|2j„ax + \S\\Ns\Jrn..)) 
>2-l^l exp(-2((i+l)|5|Vn,ax) 

A. 2. Correlation Decay and Large Girth. We assume that the 
Ising model on the bounded degree graph is further in the correlation decay 
regime. The following lemma characterizes the conditions under which the 
Ising model is (Di, -D2j e)-loosely connected. 

Lemma A.l. Assume {d — l)tanhJmax < 1- Fix Di, D2. Let 

In [/3(ivln2)] 
Ini 

a 

where A = ^(1 - e-^-^--)e-s(^i+^2)<iJ„,ax^ ^nd let e = 48^e^(^i+^2)'^'^"-''. 
Assume that there are at most Di paths shorter than h between non-neighbor 
nodes and D2 paths shorter than h between neighboring nodes. Then\/{i, j) G 
E, 

min max max \P(xi\xi,xs,XT) — P(xi\x'^,xs,XT)\ > e, 

ScV\{iUj} TcV\{iUj} x^,x„x'xs,xr^ K ^\ J, b, IJ K z\ p 1)\ 
|S|<Di |r|<D2 
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and ^ E, 

min max max \P(xi\xi,xs,XT) — P(xi\x'^,xs,XT)\ < -■ 

ScV\{iUj} TcV\{iUj} Xi,Xj,x'xs,XT 4 
\S\<Di \T\<D2 

Proof. First consider G E. Without loss of generality, assume Jij > 
0. By the assumption that there are at most D2 paths shorter than h between 
neighboring nodes, there exists T' C Ni,\T'\ < D2 such that, when the set 
T' is removed from the graph, the length of any path from i to j is no less 
than h. For any S, let T = T'\S. To simplify the notation, let R = SUT 
and W = V \ R. For any value xr, let Q be the joint probability of Xw 
conditioned on Xr = xr, i.e., Q{X]y) = P(X\y\xji). Q has the same edge 
coefficients for the unconditioned nodes, but is not zero-field as conditioning 
induces external fields. Let Q denote the joint probability when edge is 
removed from Q. We note that Q and Q satisfy the same correlation decay 
property as P, so 

Q{1, 1) =Q{Xi = l)Q{Xj = l\Xi = 1) 

>Q{Xi = l)[Q{Xj = l\Xi = -1) - 
>QiXi = l)[Q{Xj = l\Xi = -1) - Pa''] 

Similarly, -1) > Q{Xi = -l)[Q{Xj = -l\Xi = 1) - /3a'']. Then, 

Q(1,1)Q(-1,-1) 
>Q{Xi = l)QiXi = -l)[Q{Xj = l\Xi = -1) - M 

[Q{Xj = -l\Xi = 1) - Pa''] 
>Q(l,-l)Q(-l,l)-2^a'» 

Using the above inequality, we have the following lower bound on the P-test 
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quantity. 

max^ \P{xi\xj,xs,XT) - P{xi\x'j,xs,XT) 

> \Q{xi = l\xj = 1) — Q{xi = l\xj = —1)1 
_ Q{xi = l,Xj = 1) Q{xi = l,Xj = -1) 



Q{xj 



Q{xj 



■l)Q{xi 



-l,Xj = 1) 



Qixj = l)Q{xj 



e2'^-Q(l,l)Q(-l,-l) 



e-2^^»Q(l,-lM-l,i: 



(e-^iiQil, 1) + e-J^iQ{-l, 1)) (e--^^'Q(l, -1) + e-^iiQ{-l, -1)) 

=(1 - e-^-^-)Q(l, -1)Q(-1, 1) - 2^«'^ 
>(1 - e-^-^-'")Q(l, -l)g(-l, 1) - 2Pa^. 

Let Q denote the joint probability when aU the external field terms are 
removed from Q; i.e., 

QiXw) oc QiXw)e''^^'^ 

As there are at most {Di +D2)d edges between R and W , we have < 
{Di + D2)dJraa.x- Hence, for any subset U gW and value xu, 



Q{xu) = 



Q{xu) 



Q(x[/)e-(^i+^2)dJ^ax 



> 



g(-Dl+D2)dJmax 



-2(Dl+D2)(iJn^ 



Moreover, Q is zero-field by definition and again has the same correlation 
decay condition as P, hence 

Q(l,-1) + Q(l,l) =Q{Xi = l) = ^ 

0(1,1) - 
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which gives the lower bound Q{1,—1) > ^^^^p^^^- Therefore, we have 

-2(Dl+D2)dJmax 

Q(l>-l)>^T7^-w^- 



2(1 + 6/^""^ 

The same lower bound applies for Q(— 1, 1). Hence, 

max^ \P{xi\xj,xs,XT) - P{xi\x'j,xs,XT)\ 

n _ p-iJm\T,\p-i[Dl+D'2)dJmB.^ 

1^ 

>e2. 

The second inequality uses the fact that e^°^^ < 2. The last inequality is by 

the choice of h. 

Next consider (i, j) ^ E. By the choice of h, there exists S C Ni, \S\ < Di 
such that, when the set S is removed from the graph, the distance from i to 
j is no less than h. Let T set with \T\ < D-2. As there is no edge between 
z, j, the joint probability Q and Q are the same. Then Vx^ 

\P{Xi\Xj,XS,XT) - P{Xi\ - Xj,XS,XT)\ 

=\Q{xi\xj) - Q{xi\ - Xj)\ 
\Q{xi^ Xj^Q(^ Xj^ Q{xi^ Xj^Qi^ x^, Xj^\ 
Q{xj)Q{-Xj) 

Similar as above, we have 

The same bound applies for Q{—Xj). Therefore, 

\P{Xi\xj,XS,XT) - P{xi\ - Xj,Xs,XT)\ 

By correlation decay and the fact /3a'* < ln2 < 1, 

Qi^Xi^ Xj^Q(^ Xi^ Xj^ 

=Q{xi\xj)Q{xj)Q{—Xi\ — Xj)Q{—Xj) 

<{Q{xi\ - Xj) + Pa^)Q{xj){Q{-Xi\ - xj) + Pa^)Q{-Xj) 

<Q{xi, —Xj)Q{—Xi, Xj) + SPa'^. 
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Similarly, we have Q{xi,Xj)Q{—Xi,—Xj) > Q{xi, —Xj)Q{—Xi,Xj) —2(3a^. 
Hence, by the choice of h, 




□ 



Now we specialize this lemma for large girth graphs, in which there is at 
most one short path between non-neighbor nodes and no short non-direct 
path between neighboring nodes. Setting Di = 1 and D2 = in the lemma, 
we get Theorem 3.5. For the lower bound on the correlation between neigh- 
bor nodes, we set Di = D2 = in the lemma and get Lemma 5.3. 



APPENDIX B: FERROMAGNETIC ISING MODELS 

B.l. Proof of Corollary 6.3. By Proposition 6.2, we apply Defini- 
tion 6.1 to X with f{X) = Xi and g{X) = Xj, and get E[XiXj] > 
E[Xi]E[Xj]. As there is no external field, P{Xi = 1) = P{Xi = -1) = 
for any i and P{Xi = Xi,Xj = xj) = P{Xi = —Xi,Xj = —xj) for any 
Therefore, E[X,i\ = and 

E[X,Xj] =A[PiXi = l,Xj = 1) - P{Xi = l,Xj = -l)][P{Xi = 1,X, = 1) 



By the above inequality, noticing that P{Xi = 1, Xj = 1) + P{Xi = 1, Xj = 
— 1) = 2) we get the result. 

B.2. Proof of Lemma 6.4. For any i e V,j e Ni,S C V, Q, Q, Q are 

defined as in the proof of Lemma A.l. When X is ferromagnetic but with 
external field, as in Corollary 6.3, we can show that 



!)]• 



P{Xi = 1,X, = l)P{Xi = -1,X, = -1) 
>P{Xi = l,Xj = -l)PiX, = -l,Xj = 1) 



for any Therefore, we have 



max \P{xi\xj,xs) — P{xi\x'j,xs)\ 



>g-2j,, (.^Jj^Q(^i^ 1)Q(-1, -1) - e-'^'^'H 

>g-2J,,^g2J,, _ e-2J,i)g(i^ 1)Q(-1, -1) 

>(l_e-4^-in)g(i^l)Q(_l^_l). 



g(l,-l)Q(-l,l) 
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We note that Q is zero field, so by Corollary 6.3 we get Q(l, 1) = Q{—^-, — 1) 
> |. As shown in Lemma A.l, 

Q(l,l) > e-2|^s|JmaxQ(l^l) > lg-2|iVs|Jmax^ 

The same lower bound can be obtained for Q(— 1, —1)- Plugging the lower 
bounds to the above inequality, we get the result. 

APPENDIX C: RANDOM GRAPHS 

The proofs in this section are related to the techniques developed in [2, 3]. 
The key differences are in adapting the proofs for general Ising models, as 
opposed to ferromagnetic models. We point out similarities and differences 
as we proceed with the section. 

C.l. Self- Avoiding- Walk Tree and Some Basic Results. This sub- 
section introduces the notion of a self- avoiding- walk (SAW) tree, first intro- 
duced in [20], and presents some properties of a SAW tree. For an Ising 
model on a graph G, fix an ordering of all the nodes. We say dge is 
larger (smaller resp.) than (i,/) with respect to node i if j comes after (be- 
fore resp.) / in the ordering. The SAW tree rooted at node i is denoted as 
Tsawii'i G). It is essentially the tree of self-avoiding walks originated from 
node i except that the terminal nodes closing a cycle are also included in 
the tree with a fixed value +1 or —1. In particular, a terminal node is fixed 
to +1 (resp. —1) if the closing edge of the cycle is larger (resp. smaller) than 
the starting edge with respect to the terminal node. Let A denote the set of 
all terminal nodes in Tsawi^] G) and xa denote the fixed configuration on A. 
For set S <ZV ^ let U{S) denote the set of all non-terminal copies of nodes in 
S in Tsawii] G). Notice that there is a natural way to define conditioning on 
Tsaw{i] G) according to the conditioning on G; specifically, if node j in graph 
G is fixed to a certain value, the non-terminal copies of j in tree Tgawii] G) 
are fixed to the same value. 

One important result is [10, Theorem 7], motivated by [20], says that the 
conditional probability of node i on graph G is the same as the corresponding 
conditional probability of node i on tree Tsaw{i\G), which is easier to deal 
with. 

Proposition C.l. Let S he a subset ofV. 'ixi,xs,P{xi\xs;G) = 

P{Xi\xui^S):XA]Tsaw{i; G)). 

Next we list some basic results which will be used in later proofs. First we 
have the following lemma about the number of short paths between a pair 
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of nodes from [2]. The second part of Theorem 3.6 is an immediate result of 
this lemma. 

Lemma C.2. [2] For all i,j € V, the number of paths shorter than 7p 
between nodes i,j is at most 2 almost always. 

Let B{i,l; 

Tsaw{i]G)) be the set of nodes of distance I from i on the 
tree Tsawii',G). Recall that A is the set of terminal nodes in the tree. Let 
A be the subset of A that are of distance at most jp from i. The size of 
B{i,l;Tsaw{j''i G)) and A are upper bounded as follows. 

Lemma C.3. [13, Lemma 2.2] For I <l < a\ogp, where < a < j^^, 
we have 

max Tsau,(i; = 0(c' logp), almost always. 

i 

Lemma C.4. Vi G V, \A\ < 1 in Tg(i^{i] almost always. 

Proof. Each terminal node in A corresponds to a cycle connected to i 
with the total length of the cycle and the path to i at most 7p. Let OLOi 
denote the subgraph consists of two connected circles with total length /. 
This structure has / — 1 nodes and I edges. Let H = {OLOi, I < 2^p} and 
Nh denote the number of subgraphs containing an instance from H. Then 
it is equivalent to show that there is at most 1 such small cycle close to each 
node or Nh = almost always. 

/=i ^ ^ p 1=1 

=0{p~^-fpC^^'') < Oip~h = o(l). 
So, P{Nh > 1) = o(l). □ 

C.2. Correlation Decay in Random Graphs. This subsection is to 
prove the first part of Theorem 3.6 which characterizes the correlation decay 
property of a random graph. 

First we state a correlation decay property for tree graphs. This result 
shows that having external fields only makes the correlation decay faster. 

Lemma C.5. Let P be a general Ising model with external fields on a 
tree T. Assume \Jij\ < Jmax- Vi, j G T, 

\P{xi\xj) - P{xi\x'^)\ < (tanh J^ax)''^''^'^ 
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Proof. The basic idea in the proof is get an upper bound that does 
not depend on the external field. To do this, we proceed as in the proof of 
Lemma 4.1 in [5]. First, as noted in [5], w.l.o.g. assume the tree is a line 
from i to j. Then, we prove the result by induction on the number of hops 
in the line. 

1. d{i,j) = 1 or j G Ni. The graph has only two nodes. We have 



^JijXj-\-h{ _j_ g JijXj h{ 

Hence, 



^^Jij _j_ g "^Jij _|_ g2/lj _|_ g 2/l.i 



This function is even in both Jij and /ij. Without loss of generality, 
assume Jjj > 0, /ij > 0. It is not hard to see that the RHS is maximized 
when /ij = 0. So 

\P{xi\xj) — P[xi\xj)\ < tanh|Jjj| < tanh J^ax- 

The inequality suggests that, when there is external field, the impact 
of one node on the other is reduced. 
2. Assume the claim is true for d{i,j) < k. For d{i,j) = k + 1, pick any / 
on the path from i to j, and note that Xi — Xi — Xj forms a Markov 
chain. Moreover, d{i,l) < k and d{l,j) < k. 

\P{Xi\Xj) - P{Xi\x'j)\ 

=1 ^ P{Xi\xi)P{xi\Xj) - ^ P{xi\xi)P{xi\x'j)\ 

Xl Xl 

=\P{xi\xi){P{xi\xj) - P{xi\x'^)) + P{x,\x'i){P{x[\x,) - P{x\\x]))\ 
=\{P{x.i\xi) - P{x.i\x\)){P{xi\xj) - P{xi\x'^))\ 

<(tanh J^axj^^^'-'Htanh J^ax)''^''^) = (tanh J^ax)''^''^') 

The third equality follows by observing that P{xi\xj) — P{xi\x'j) = 
— {P{x'j\xj) — P{x'i\x'^)). The last inequality is by induction. 

□ 
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Writing the conditional probability on a graph as a conditional probability 
on the corresponding SAW tree, we can apply the above lemma and show 
the correlation decay property for random graphs. 

Lemma C.6. Let P be a general Ising model on a graph G. Fix i £ V . 
^ Ni, let S be the set that separates the paths shorter than 7 between i,j 
and B = B{i,^;Tsaw{i;G)) , thenyxi,Xj,x'j,xs, 



\P{xi\xj,xs) — Pixi\x'j,xs)\ < |-B|(tanhJ] 



max J ■ 



Proof. Let Z be the subset of U{j) on Tsaw{i]G) that is not separated 
by U{S) from i. By the definition of S, Z is of distance at least 7 from i. So 
the 7-sphere B separates Z and i. 

\P{Xi\Xj,Xs) - P{Xi\x'j,Xs)\ 

= \P{xi\xu(j),xu(^s),XA;Tsaw{i; G)) - P{xi\x'^(^jyXu(s),XA;Tsaw{i; G))\ 
= \P{xi\xz,xu(^s),XA; Tsaw{i; G)) - P{xi\x'z, xu(s),^a; Tsaw{i; G))\ 

{v,G))P[xb\ 



XB 



Xi\xB: 2; (7(5) , XA', Tgaw (^j G))P{x b\x z ^ Xu(^g^ , XA, Tgaw 



XB 



< max P{xi\xB,xu/s),x a; Tsawii] G)) - min P{xi\xB, xij/s),xa; Tsaw{i; G)) 

xb ^ ' xb ^ ' 

= P{xi\xB, X[7(5), xa; Tsawii; G)) - P{xi\x]^, xu(^s)^xa; Tsaw{i; G)) 
<|5|(tanhJ^ax)^. 

In the above, (a) follows from the property of SAW tree in Prop C.l. Step 
(b) is by the choice of S and the definition of Z. Step (c) uses the fact that 
Z is separated from i by B. In (d), x^,x^ represent the maximizer and 
minimizer respectively. Step (e) is by telescoping the sign of x^. Notice that 
the Hamming distance between a;^,x^ is at most \B\, and we can apply 
the above lemma to each pair as the conditioning terms differ only on one 
node. The above proof is similar to the proof of Lemma 3 in [2]. However, 
in going from step (c) to step (d) above, it is important to note that our 
proof holds for general Ising models, whereas the proof in [2] is specific to 
ferromagnetic Ising models. □ 

Proof of Theorem 3.6. As in [2], setting 7 = 7^ in the above lemma 
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and noticing that 

\B{i,-fp;Tsav^ii;G))\ = 0{c^'nogp), 

we get 

\P{Xi\xj,Xs) - P{Xi\x'j,Xs)\ 

log CK 

<0((ctanh Jmax)'''*' logp) = 0(p~ A'logc logp) = o(p~'^). 

□ 

C.3. Asymptotic Lower Bound on P{xi\xji) When \R\ < 3. 

This subsection is to prove that P{xi\xfi) is lower bounded by some con- 
stant when \R\ < 3. This result comes in handy when proving the other two 
theorems. This result was conjectured to hold in [2] for ferromagnetic Ising 
models on the random graph 0{p, ^) without a proof. Here we prove that it 
is also true for general Ising models on the random graph. 

Lemma C.7. Vi e V,yR C V,\R\ < 3, there exists a constant C such 
that Wxi, xr, P{xi\xji) > C almost always. 

This basic idea is that the conditional probability P{xi\xfi) is equal to 
some conditional probability on a SAW tree, which in turn is viewed as 
some unconditional probability on the same tree with induced external fields. 
Then we apply a tree reduction to the SAW tree till only the root is left, and 
show that the induced external field on the root is bounded, which implies 
that the probability of the root taking +1 or —1 is bounded. 

On a tree graph, when calculating a probability which involves no nodes 
in a subtree, we can reduce the subtree by simply summing (marginalizing) 
over all the nodes in it. This reduction produces an Ising model on the 
rest part of the tree with the same Jij and hi except for the root of the 
subtree, which would have an induced external field due to the reduction 
of the subtree. The probability we want to calculate remains unchanged on 
this new tree. Such induced external fields are bounded according to the 
following lemma. 

Lemma C.8. Consider a leaf node 2 and its parent node 1. The induced 
external field h'l on node 1 due to summation over node 2 satisfies 

{h'll ^ 1^2! tanh | Ji2|- 

We first prove an inequality which is used in the proof of the above lemma. 



36 R. WU, R. SRIKANT AND J. NI 

Lemma C. 9. \/x>0,y>0, 

pX+y I p-x-y 
g2a;tanhj/ ^ ^ ' ^ 

— g,x-y _j_ Q-x+y ■ 

Proof. Let u = tanhy G [0, 1), then y = ^ In The required result is 
equivalent to showing that 

e2^"[(l + u)e-^ + (1 - u)e'] > (1 + u)e' + (1 - u)e-^. 

Define 

= (1 + u)e"^ + (1 - n)e(i+")^ - (1 + tx)e^ - (1 - n). 
Clearly, /u(0) = 0, and 

fi{z) = (1 + u)[ue^' + (1 - i.)e(i+")^ - e^]. 

By the convexity of e^, ue^^ + (1 — u)e'^^^'^'>^ > e^. Hence, fu{z) > 0, which 
implies fu{z) > 0. We finish the proof by noticing that the original inequality 
is equivalent to fuC^x) > 0. □ 

Proof of Lemma C.8. 

^~^^Ji2XiX2+h2X2 _ ^Ji2Xi+h2 _j_ g- Ji2a:i-/i2 Qh[xi 
X2 

Comparing the ratio of xi = ±1, we get 



_ _ e"'°i. 

g->/l2+/l2 _|_ £^12-/12 e~^i 



So 



]^ g^l2+^2 _|_ Q—Jl2—h2 

h\ = - log f — -r f < I /i2 1 tanh I Jio I . 

The last inequality follows from Lemma C.9. □ 

It is easy to see that \h']\ < |/i2| tanh|Jjnax| < I ^2 1- By induction, we can 
bound the external field induced by the whole subtree. 
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Proof of Lemma C.7. First we have 

P{Xi\xR) =P{xi\xu(^ji),XA;Tsawii; G)) 

> imnP{xi\xB,Xjj^j^yX^; Tsaw{i; G)) 

=P[Xi\Xjj , Xjj^j^y X ^'jTgaw{i] G)) = Q{Xi), 

where Q is the probabihty on the tree with external fields induced by 
x^,XQ^^yX^. We only need to consider the external fields on the parent 

nodes of B,U{R),A as the conditional probability is on a tree. The nodes 
affected by B are all 7p away from i and the total number of them is no larger 
than \B\, which is bounded by Lemma C.3. The number of nodes affected 
by U{R), A is no larger than + |^|. By Lemma C.2 and Lemma C.4, 

\U{R)\ < 2\R\ and \A\ < 1 almost always. Applying the reduction technique 
to the tree till a single root node i, by Lemma C.8, we bound the induced 
external field on i as 

\hi\ <[(tanh J^ax^-I^l + {\U{R)\ + |i|)]Jmax 

<0((ctanhJmax)^"logn + 2|i?| + 1) 
<0{n~^ + 7) = 0(1). 

So, 

plt^Jjl _l_ p ILiU^i 

When p is large enough, there exists some constant G such that P{xi\xii) > 
G. □ 

C.4. Proof of Theorem 3.7. Let S be the set that separates all the 
paths shorter than 7^ between nodes i, j with size l^l < 3. It is straight- 
forward to show that I{Xi] Xj\Xs) = o(p~^'*) in a manner similar to [2, 
Lemma 5]. The only difference is that the correlation decay property in 
Theorem 3.6 takes a different form, which is easier to apply, therefore the 
proof there needs to be modified accordingly. We also note that the constant 
C in Lemma C.7 is referred to as fm\n{S) in [2]. The details are omitted here. 

C.5. Proof of Theorem 3.8. When j is a neighbor of i, conditioned 
on the approximate separator T, there is one copy of j which is a child of 
the root i in the SAW tree and is the only copy that within 7p from i. In 
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Theorem 3.8, we show that the effect of conditioning on T is bounded and 
this copy of j has a nontrivial impact on i. With a httle abuse of notation, 
we use j to denote this copy of j in Tsaw{i', G). W.l.o.g assume Jij > 0. As 
in Lemma C.6, 

max\P{xi\xj,XT) — P{xi\x'j,XT)\ 

= max\P{xi\xu(^j-),Xu(^T),XA;Tsaw{i; G)) - P{Xi\x'jj(^j^,Xu(^T),XA;Tsawii;G))\ 
= max \P{xi\xz,Xu(^T),XA] Tsaw ii;G)) - P{Xi\x'z,Xu(^T),XA; Tsaw {i]G))\ 

= max I ^ ^ P(^Xi\xj , xb, Xjj^j,-^ , x^] Tgaw {i-G))P{xB\xz 

XB 1 Xfj^rp^ , X ^ , Tsau,{i]G))P{xB\ Xu{T)i XA'i Tsaw 



>niinP(xi = +\xj = +, x^, x^^^^, x^; Tsa«,(i; G)) 



xb 



maxi^(xj — — ■iXB-,x^fqii,yX^'^Tsaw{,'t'iG)) 
—P{xi = +l|xj = +1, Xb ; Xjj^rp^ , x^; Tgawi}] G)) 

P(xi = +l|xj = 1, X£j , Xjj^rp-^ , x^; Tsaw(i] G)) 
— P{xi — +l|xj — X B 1 Xjj^rp-^ , Tga^ii^ GY) 

P{xi — I Xj — 1, X£j , x-Q^rp-^ , x^, Tgdw (i, G)) 
+ P{xi = -\-l\xj = 1, Xb ) Xjj^rpy x^; Tgawii'i G)) 

P{xi — -|-l|Xj — 1, X£j , Xjj^rp-^ , X Tsaw{i^ Gy) 

>Q{xi = +l\xj = +1) - Q{xi = +l\xj = -1) - |S|(tanh Ji^ax)^", 

where Q is the probabihty measure on the reduced graph with only nodes 
We have 

Q{xi = +l\xj = +1) - Q{xi = +l\xj = -1) 



> L_ ^ = J^(e-2|'^'l). 

The external fields in Q are induced by the conditioning on B,U{T),A. 
As in the proof of Lemma C.7, we have < 0(1), so Q{xi = +|xj = 
+) — Q{xi = +\xj = — ) = 0,(1). Hence, 

max|P(xi|xj,X5) - P(xi|Xj-, x^)] > 0(1) - 0{p'^) = 
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Using this result, the lower bound I{Xi; Xj\XT) = ^^(1) simply follows 
from the proof of [2, Lemma 7]. Again we note that the constant C in 
Lemma C.7 is referred to as /min(r) in [2]- The details are omitted here. 

C.6. Proof of Theorem 6.6. The proof of the theorem needs the 
following lemma. 

Lemma C.IO. X is a ferromagnetic Ising model (possibly with external 
fields). Vi G y,V5 C y \i, 

= +l\xs = +1) > P{xi = +l\xs = -1). 

Proof. For any node j G S, let probability P{xi,Xj) = P{xi, Xj\xs\j)- 
The probability P is still ferromagnetic and hence is associated. Then we 
have 

P{Xi = +l,Xj = +l)P{Xi = -l,Xj = -1) 
>P{Xi = +l,Xj = -1)P{X^ = -l,Xj = +1). 

After some algebraic manipulation, we get 

P{Xi = +l\xj = +1) > P{xi = +l\Xj = -1). 

This is equivalent saying that 

P{Xi = +l\Xj = +l,Xs\j = +1) > P{Xi = +l\Xj = -l,xs\j = +1). 

So flipping one node from +1 to —1 reduces the conditional probability 
regardless the what value the rest of the nodes take. Continuing this process 
till we flip all the nodes in S, we get the result 

= +l\xs = +1) > P{xi = +l\xs = -1). 

□ 

Proof of Theorem 6.6. For (i, j) G E, assume Jij > 0. Following the 
proof of Theorem 3.8, 

max \P{xi\xj,xs) - P{xi\x'j, xs)\ 

= max|P(a;j|xf/(j),Xt/(5),XA;Tsau.(i; G)) - P{xi\x'jj^jy xu(^s)^ XA]Tsawii; G))\ 

>P{Xi = -\-l\xjj^j-^ = X ^ , X^^c^y X ^^Tsawi^] G)) 

— P{xi = = —1, , xj^; Tsaw{i] G)). 
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The only difference liere is tliat we miglit liave more than one copy of j 
in U {j). Let Z = U (j) \ j. By the above lemma, we have 

max\P{xi\xj,xs) — P{xi\x'j, xs)\ 

^P{xi = -\-l\xj = +1, xz = +1, , Xjj^g-j , x^; Tsaw{i'i G)) 

— P{xi = -\-l\xj = —1, xz = +1, Xj^ , Xjj^gy x^; Tgaw{i] G)) 
+ P{xi = -\-l\xj = —1, Xz = ~1, x^ , Xjj^gy x^; Tgawii'i G)) 

— P{Xi = -\-V\Xj = —l,Xz = "1)2;^ , Xjj^g-^ , X ^] Tgawij'] G)) 
>Q{xi = +l\xj = +1) - Q{xi = +l\xj = -1) - |5|(tanh Jmax)^". 

As the size of Z is only a constant, by the same reasoning, we finish the 
theorem. □ 

APPENDIX D: CONCENTRATION 

Before proving the concentration results in Lemma 4.3, we first present the 
following lemma which upper bounds the difference between the entropies of 
two distributions with their /i-distance. Let P and Q be two probability mass 
functions on a discrete, finite set X, and H{P) and H{Q) be their entropies 
respectively. The h distance between P and Q is defined as ||P — Q||i = 

Y:.^x\nx)-Q{x)\. 

Lemma D.l. [7, Theorem 17.3.3] If \\P - Q\\i < \, then \H{P) - 
H{Q)\ < -\\P - Qllilog "^1^^"' - When \\P - QWi < \, the RHS is in- 
creasing in ||P — Qlli. 



Proof of Lemma 4.3. By definition, \/S C V and \/xs, lljvd)- \ 
P{xs)\ < 1 and 

1 " 



{x):>=xsy 

i=l 



By the Hoeffding inequality. 



p(\P{xs)-Pixs)\>7) 
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1. By the union bound, we have 

P (3S C V, \S\ < 2, 3xs, \P{xs) - P{xs)\ > t) 

For our choice of n, Vi,j G V,yxi,Xj, 

\P{xi,Xj) - P{xi,Xj)\ < 7, \P{xi) - P{xi)\ < 7, 

with probabihty 1 — ^ for some constant ci, wliich gives P{xj) > 
P{xj) — 7>5 — 7>jas7<|;. Hence, 

\P{Xi\Xj) - P{Xi\Xj)\ 
_ \P{Xi,Xj)P{Xj) - P{Xi,Xj)P{Xj)\ 

P{x,)P{x,) 

^ P{Xi,Xj)\P{Xj) - PiXj)\ _^ P{Xj)\P{Xi,Xj) - P{Xi,Xj)\ 



P{xj)Pixj) P{xj)Pixj) 

27 . 

2 

2. By the union bound, we have 

3i€V,3ScLi,\S\ <Di + D2 + l,3xs, 
\P{xs) - P{xs)\ > 7, \P{x^,xs) - P{xi,xs)\ > 7 

<2pL^i+^2+i|A'|^i+^2+22g-^ 

<4e- log L+(Di+D2+2) log \X\ _ 

For our choice of n, Vi G V, Vj G Li,\/S C Lj, IS"! < Z)i+L>2, Vxj, Xj,xs, 

\P{xi,Xj,xs) - P{xi,Xj,xs)\ < 7> - P{xj,xs)\ < 7, 

with probabihty 1 — || for some constant C2, which gives P{xj,xs) > 



R. WU, R. SRIKANT AND J. NI 
P{xj, xs) — 7>|as7<2. Hence, 

\P{Xi\Xj,Xs) - P{Xi\Xj,Xs)\ 
_\P{Xi,Xj,Xs)P{Xj,Xs) - P{Xi,Xj,Xs)P{Xj,Xs)\ 

P{xj,xs)P{xj,xs) 

P{Xi,Xj,Xs)\P{Xj,Xs) - P{Xj,Xs)\ 



< 



+ 



P{xj,xs)P{xj,xs) 

P{Xj,Xs)\P{Xi,Xj,Xs) - P{Xi,Xj,Xs)\ 



P{xj,xs)P{xj,xs) 
- 6 

. As in the previous case, for our choice of n, Vi, j E V,\/S C -Z^i, 15"! < 
Di + D2,yxi,Xj,xs, 

\P{xi,Xj,xs) - P{xi,Xj,xs)\ <7, 
\P{xj,xs) - P(xj,xs)\ <7, 
\Pixs) - P{xs)\ <7 
with probabiUty 1 — ^ for some constant C3. So we get 

\\P{Xi,Xj,Xs)-P{X„Xj,Xs)\\i < |;f|^i+^2+27 < 1. 

By Lemma D.l, 

\H{Xi,Xj,Xs) - H{Xi,Xj,Xs)\ 
<-\\P{Xi,Xj,Xs)- P{Xi ,Xj,Xs)\\i 

log 



\\P{Xi,Xj,Xs)-PiX„Xj,Xs)\\i 



< _ |A^|^i+^2+2^iog^ = -2|A'|^i+^2+27log V7 



The last inequality used the fact that < —^/Jlog^ < 1 for < 
7 < 1. Similarly, we have the same upper bound for \H{Xi, Xs) — 
H{Xi,Xs)\,\H{X,,Xs)-H{Xj,Xs)\and\H{Xs)-H{Xs)\.We&msh 
the proof by noticing that 

I{X^;Xj\Xs) = H{Xi, Xs) + H{X,,Xs) - H{X^,X„Xs) - H{Xs). 

□ 
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